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Abstract 

pL^ The restricted Boltzmann machine (RBM) is a flexible tool for modeling complex data, however there 

1^ have been significant computational difficulties in using RBMs to model high-dimensional multinomial ob- 

servations. In natural language processing applications, words are naturally modeled by i^-ary discrete 
distributions, where K is determined by the vocabulary size and can easily be in the hundred thousands. The 

I I conventional approach to training RBMs on word observations is limited because it requires sampling the 

\^ states of K-w&y softmax; visible units during block Gibbs updates, an operation that takes time linear in K. 

I 1 In this work, we address this issue by employing a more general class of Markov chain Monte Carlo operators 

^ on the visible units, yielding updates with computational complexity independent of K. We demonstrate the 

success of our approach by training RBMs on hundreds of millions of word n-grams using larger vocabularies 
I ' than previously feasible with RBMs and using the learned features to improve performance on chunking and 

^ sentiment classification tasks, achieving state-of-the-art results on the latter. 

> 

^ 1 Introduction 

l/^ The breadth of applications for the restricted Boltzmann machine (RBM) [Smolensky, 1986, Freund and Haussler, 

1991] has expanded rapidly in recent years. For example, RBMs have been used to model image patches [Ranzato 
(—^ et al., 2010], text documents as bags of words [Salakhutdinov and Hinton, 2009], and movie ratings [Salakhutdinov 

et al., 2007], among other data. Although RBMs were originally developed for binary observations, they have 
^-H been generalized to several other types of data, including integer- and real- valued observations [Welling et al., 

^ 2005]. 

However, one type of data that is not well supported by the RBM is word observations from a large vo- 
cabulary (e.g., 100,000 words). The issue is not one of representing such observations in the RBM framework: 
^ so-called softmax units [Salakhutdinov and Hinton, 2009] are the natural choice for modeling words. However, 

manipulating distributions over the states of such units is expensive even for intermediate vocabulary sizes and 
quickly becomes impractical for vocabulary sizes in the hundred thousands — a typical situation for NLP prob- 
lems. For example, with a vocabulary of 100,000 words, modeling n-gram windows of size n = 5 is comparable 
in scale to training an RBM on binary vector observations of dimension 500,000 (i.e., more dimensions than a 
700 X 700 pixel image) . This scalability issue has been a primary obstacle to using the RBM for natural language 
processing. 

In this work, we directly address the scalability issues associated with large softmax visible units in RBMs. 
We describe a learning rule with a computational complexity independent of the size of the softmax visible 

units. We obtain this rule by replacing the Gibbs sampling transition kernel over the visible units with carefully 
implemented Metropolis-Hastings transitions. By training on hundreds of millions of word windows, we are 
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able to learn representations capturing meaningful syntactic and semantic properties of words. Our learned 
word representations provide benefits on a chunking task competitive with other methods of inducing word 
representations and our learned n-gram features yield even larger performance gains. Finally, we also show how 
similarly extracted n-gram representations can be used to obtain state-of-the-art performance on a sentiment 
classification benchmark. 



2 Restricted Boltzmann Machines 



We first describe the restricted Boltzmann machine for binary observations, which provides the basis for other 
data types. An RBM defines a distribution over a binary visible vector v of dimensionality V and a layer h of 
H binary hidden units through an energy 



E{v, h) = -b'^v - c'^h - h'^Wv. 



(1) 



This energy is parameterized by bias vectors b G MY and c G and weight matrix W € R^^^ , and is converted 
into a probability distribution via 



p{v, h) = exp (-£;(v, h)) /Z, Z^J2 exp(-i?(v', h')) • 



v', h' 



This yields simple conditional distributions: 

p(h|v) = ]Jp(/ij|v), p{hj = l|v) = sigm(cj + ^ VF^iVj) 

j i 

p(v|h) = JJp(Di|h), p{vi = l|h) = sigm(6j + '^Wjihj) 



(2) 

(3) 
(4) 



where sigm(z) — 1/(1 + e^^), which allow for efficient Gibbs sampling of each layer given the other layer. 

We train an RBM from T visible data vectors {vt}f=i by minimizing the scaled negative (in practice, penal- 
ized) log likelihood of the parameters = (b, c,W): 



^MLE 



argmin, C{0)C{9) = i^£(vt;0), ^(vt;0) = -logp(vt) = - log ^p(vt, h). 



(5) 



The gradient of the objective with respect to 6 
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is intractable to compute because of the exponentially many terms in the sum over joint configurations of the 
visible and hidden units in the second expectation. 

Fortunately, for a given 6, we can approximate this gradient by replacing the second expectation with a 
Monte Carlo estimate based on a set of M samples M ~ {v„} from the RBM's distribution: 
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Figure 1: Illustration of an RBM with binary observations (left) and i^-ary observations, for n = 2 and K = 3, 
i.e. a pair of 3-ary observations (right). 

The samples {vm} are often referred to as "negative samples" as they counterbalance the gradient due to the 
observed, or "positive" data. To obtain these samples, we maintain M parallel Markov chains throughout 
learning and update them using Gibbs sampling between parameter updates. 

Learning with the Monte Carlo estimator proceeds by alternating between two steps: 1) Using the current 
parameters 9, simulate a fews steps of the Markov chain on the M negative samples using Eqs. (3)-(4); and 
2) Use the negative samples, along with a mini-batch (subset) of the positive data to compiitc the gradient in 
Eq. (6) and update the parameters. This procedure belongs to the general class of Robbins-Monro stochastic 
approximation algorithms [Younes, 1989]. Under mild conditions, which include the requirement that the Markov 
chain operators leave p(v, h | 9) invariant, this procediire will converge to a stable point of C{9). 

For JC-ary observations — observations belonging to a finite set of K discrete outcomes — we can use the same 
energy function as in Eq. (1) for the binary RBM by encoding each observation in a "one-hot" representation 
and concatenating the representations of all obscirvations to construct v. In other words, for n separate K-ary 
observations, the visible units v will be partitioned into n groups of K binary units, with the constraint that each 
partition can only contain a single non-zero entry. Using the notation Va-.b to refer to the subvector of elements 
from index a to index b, the i"^ observation will then be encoded by the group of visible units V(j_]^-)^^]^.j;f . The 
one-hot encoding is enforced by constraining each group of units to contain only a single 1-valued unit, the others 
being set to 0. The difference between RBMs with binary and i^-ary observations is illustrated in Figure 1. 

To simplify notation, we refer to the i**^ group of visible units as v^'^ =V(j_i)A'+i:,;A'. Similarly, we will refer 
to the biases and weights associated with those units as b^'^ = h(^i_i'^x+i:iK said W'-*-' = ^^.^(^i-i-^K+i-.iK- We 
will also denote with the one- hot vector with its fc"^ component set to 1. 

The conditional distribution over the visible layer is then 



p(v|h)=[]p(vW|h), p(v«=efc|h) 



n 



exp(b«^efc-hh"rw«efc) 



(7) 



i=l 



J2k' exp(bW'^efc'+hTwWefc') ' 



Each group v*-*) has a multinomial distribution given the hidden layer. Because the multinomial probabilities 
are given by a softmax nonlinearity, the group of units v^'^ are referred to as softmax units [Salakhutdinov and 
Hinton, 2009]. 
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3 Difficulties with Word Observations 



While in the binary case the size of the visible layer is equal to the number of observations, in the K-aiy case the 
size of the visible layer is K times the number of observations. For language processing applications, where K is 
the vocabulary size and can run into the hundred thousands, the visible layer can become unmanageably large. 

The difficulty with large K is that the Gibbs operator on the visible units becomes expensive to simulate, 
making it difficult to perform updates of the negative samples. That is, generating a sample from the conditional 
distribution in Eq. (7) rapidly dominates the stochastic learning procedure as K increases. The reason for this 
expense is that it is necessary to compute the activity associated with each of the K possible outcomes, even 
though only a single one will actually be selected. 

On the other hand, given a mini-batch {v^} and negative samples {vm}, the gradient computations in Eq. (6) 
are able to take advantage of the sparsity of the visible activity. Since each vj and Vm only contain n non-zero 
entries, the cost of the gradient estimator has no dependence on K and can be rapidly computed. Thus the only 
barrier to efficient learning of high-dimensional multinomial RBMs is the complexity of the Gibbs update for the 
visible units. 

Dealing with large multinomial distributions is an issue that has come up previously in work on neural 
network language models [Bengio et al., 2001]. For example, Morin and Bengio [2005] addressed this problem 
by introducing a fixed factorization of the (conditional) multinomial using a binary tree in which each leaf is 
associated with a single word. The tree was determined using an external knowledge base, although Mnih and 
Hinton [2009] investigated ways of extending this approach by learning the word tree from data. 

Unfortunately, tree-structured solutions are not applicable to the problem of modeling the joint distribution 
of n consecutive words, as we wish to do here. Introducing a directed tree breaks the undirected, symmetric 
nature of the interaction between the visible and hidden units of the RBM. While one strategy might be to use 
a conditional RBM to model the tree-based factorizations, similar to Mnih and Hinton [2009], the end result 
would not be an RBM model of n-gram word windows, nor would it even be a conditional RBM over the next 
word given the n — 1 previous ones. 

In summary, dealing with K-ary observations in the Boltzmann machine framework for large K is a. crucial 
open problem that has inhibited the development of deep learning solutions NLP problems. 

4 Metropolis— Hastings for Softmax Units 

Having identified the Gibbs update of the visible units as the limiting factor in efficient learning of large-X 
multinomial observations, it is natural to examine whether other operators might be used for the Monte Carlo 
estimate in Eq. (6). In particular, we desire a transition operator that can take advantage of the same; sparse 
operations that enable the gradient to be efficiently computed from the positive and negative samples, while 
still leaving p(v, h) invariant and thus still satisfying the convergence conditions of the stochastic approximation 
learning procedure. 

To achieve this, instead of sampling exactly from the conditionals p{v^^^\h) within the Markov chain, we 
use a small number of iterations of Metropolis-Hastings (M-H) sampling. Let ^(v^*) <— v^'^) be a proposal 
distribution for group i. The following stochastic operator leaves p{v, h) invariant: 

1. Given the current visible state v, sample a proposal v for group i, such that v^'^ (/(v^*^ <— v'*') and v'^^ = v^-'^ 
for i ^ j (i.e. sample a proposed new word for position i). 

2. Replace the ith part of the current state v^'^ with v^'^ with probability: 
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I ' 5(vW ^ v(')) exp(b(')''v(^) + h"rw«vW) J 



Assuming that it is possible to efficiently sample from the proposal distribution g(v'^'' ^ v^*^), this M-H 
operator is fast to compute since it does not require normalizing over all possible values of the visible units in 
group i and, in fact, only requires the unnormalized probability of one of them. Moreover, since the n visible 
groups are conditionally independent given the hiddens, the M-H transition for each group can be simulated in 
parallel (i.e. words are sampled at every position separately). The efficiency of these operations make it possible 
to apply this transition operator many times before moving on to other parts of the learning and still obtain a 
large speedup over exact sampling from the conditional. 

4.1 Efficient Sampling of Proposed Words 

The utility of M-H sampling for an RBM with word observations relies on the fact that sampling from the proposal 
q(\-M ^■s/M^ is much more efficient than sampling from the correct softmax multinomial. Although there are 

many possibilities for designing such proposals, here we will explore the most basic variant: independence chain 
Metropolis-Hastings in which the proposal distribution is fixed to be the marginal distribution over words in the 
corpus. 

Nai'vo procedures for sampling from discrete distributions typically have linear time complexity in the number 
of outcomes. However, the alias method of Kronmal and Perterson [1979] can be used to generate samples in 
constant time with linear setup time. While the alias method would not help us construct a Gibbs sampler 
for the visibles, it does make it possible to generate proposals extremely efficiently, which we can then use to 
simulate the Metropolis-Hastings operator, regardless of the current target distribution. 

The alias method leverages the fact that any /C- valued discrete distribution can be written as a mixture, 
with uniform mixing weights, of K distributions, each over only two of the original possible oiitcomcs. Having 
constructed this mixture distribution at setup time (with linear time and space cost), new samples can be gener- 
ated in two steps: 1) sample uniformly from the K mixture components, and 2) sample from that component's 
Bernoulli distribution. 

4.2 Mixing of Metropolis Hastings 

Although this procedure eliminates dependence of the learning algorithm on K, it is important to examine the 
mixing of Metropolis-Hastings and how sensitive it is to K in practice. Although there is evidence Hinton [2000] 
that poorly-mixing Markov chains can yield good learning signals, when this will occur is not as well understood. 
We examined the mixing issue using the model described in Section 6.1 with the parameters learned from the 
Gigaword corpus with a 100,000-word vocabulary as described in Section 6.2. 

We analytically computed the distributions implied by iterations of the M-H operator, assuming the initial 
state was drawn according to JJ - g(v(*^). As this computation requires the instantiation of n 100k x 100k matrices, 
it cannot be done at training time, but was done offline for analysis purposes. Each application of Metropolis- 
Hastings results in a new distribution converging to the target (true) conditional. 

Figures 2(a) and 2(b) show this convergence for the "reconstruction" distributions of six randomly-chosen 
5-grams from the corpus, using two metrics: symmetric Kullback-Leibler (KL) divergence and total variation 
(TV) distance, which is the standard measure for analysis of MCMC mixing. The TV distance shown is the 
mean across the five group distributions. Figures 2(c) and 2(d) show these metrics broken down by grouping, 
for the slowest curves (dark green) of the top two figures. These curves highlight that the state of the hidden 
units has a strong impact on the mixing and that most groups mix very quickly while a few converge slowly. We 
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Figure 2: Convergence of the Metropolis-Hastings operator to the true conditional distribution over the visibles 
for a trained 5-grani RBM with a vocabulary of lOOK words. These curves were computed analytically, (a) Curves 
for six randomly-chosen data cases, measured in KL divergence, (b) Curves for the same six cases, measured 
in average total variation distance across the five words. (c,d) For the slowest-converging of the six top curves 
(dark green), these are broken down for each of the five multinomials in KL and total variation, respectively. 
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feel that these curves, along with the results of Section 6, indicate that the mixing is effective, but could benefit 
from further study. 

5 Related Work 

Using M-H sampling for a multinomial distribution with softmax probabilities has been explored in the context 
of a neural network language model by Bengio and Senecal [2003]. They used M-H to estimate the training 
gradient at the output of the neural network. However, their work did not address or investigate its use in the 
context of Boltzmann machines in general. 

Salakhutdinov and Hinton [2009] describe an alternative to directed topic models called the replicated softmax 
REM that uses softmax units over the entire vocabulary with tied weights to model an unordered collection of 
words (a.k.a. bag of words). Since their RBM ties the weights to all the words in a single document, there is 
only one vocabulary-sized multinomial distribution to compute per document, instead of the n required when 
modeling a window of consecutive words. Therefore sampling a document conditioned on the hidden variables 
of the replicated softmax still incurs a computational cost linear in K, although the problem is not amplified by 
a factor of n as it is here. Notably, Salakhutdinov and Hinton [2009] limited their vocabulary to K 14, 000 at 
the maximum. 

No known previous work has attempted to address the computational burden associated with K-ary obser- 
vations with large K in RBMs. The M-H-based approach used here is not specific to a particular Boltzmann 
machine and could be used for any model with large softmax units, although the applications that motivate us 
come from NLP. Dealing with the large softmax problem is essential if Boltzmann machines are to be practical 
for natural language data. 

In Section 6, we present results on the task of learning word representations. This task has been investigated 
previously by others. Turian et al. [2010] provide an overview and evaluation of these different methods, including 
those of Mnih and Hinton [2009] and of Collobert and Weston [2008] . We have already mentioned the work of 
Mnih and Hinton [2009], who model the conditional distribution of the last word in n-gram windows. Collobert 
and Weston [2008] follows a similar approach, by training a neural network to fill-in the middle word of an 
n-gram window, using a margin-based learning objective. In contrast, we model the joint distribution of the 
whole n-gram window, which implies that the RBM could be used to fill-in any word within a window. Moreover, 
inference with an RBM yields a hidden representation of the whole window and not simply of a single word. 

6 Experiments 

We evaluated the success of our proposed M-H approach to training RBMs on two NLP tasks: chunking and 
sentiment classification. Both applications will be based on the same RBM model over n-gram windows of words, 
hence we first describe the parameterization of this RBM and later present how it was used for chunking and 
sentiment classification. Both applications also take advantage of the model's ability to learn useful feature 
vectors for entire n-grams, not just individual words. 

6.1 RBM Model of n-gram Windows 

In its standard parameterization presented in Section 2, the RBM uses separate weights (i.e., different columns 
of W) to model the statistics of observations at different positions. When training an RBM on word n-gram 
windows, we would prefer to share some parameters across identical words in different positions in the window 
and factor the weights into position-dependent weights and position-independent weights (word representations). 
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Therefore, we use a parameterization of an RBM with word observations very similar to that in Mnih and 
Hinton [2007], which itself is inspired by previous work on neural language models [Bengio et al., 2001]. The 
idea is to learn, for each possible word w, a lower-dimensional linear projection of its one-hot encoding by 
incorporating the projection directly in the energy function of the RBM. Moreover, we share this projection 
across positions within the n-gram window. Let D be the matrix of this linear projection and let be the 
one-hot representation of w (where we treat w as an integer index in the vocabulary) , performing this projection 
De^ is equivalent to selecting the appropriate column D.^^ of this matrix. This column vector can then be seen 
as a real- valued vector representation of that word. The real- valued vector representations of all words within 
the n-gram are then concatenated and connected to the hidden layer with a single weight matrix. 

More specifically, let D be the D x K matrix of word representations. These word representations are 
introduced by reparameterizing W*^*' = U*^*-'D^, where U*-*-' is a position-dependent H x D matrix. The biases 
across positions are also shared, i.e., we learn a single bias vector b* that is used at all positions (b^*^ = b* Vi). 
The energy function becomes 



£;(v, h) = -c^h + J2 -b* ' v(') - U^*' Dv^') 

with conditional distributions 

p(h|v) = l[p{hj\y), p{hj = l|v) = sigm Icj + U^^ Dv^*) | (8) 



i=l 



/ \u\ TT / (i)\u\ I (i) lu^ exp(b*^efc +h'^UW Defe) 

P(v|h) = np(v< .|h), rtv' = e.|h) = W 

where U^-^^ refers to the j"^ row vector of U^'^. The gradients with respect to this parameterization are easily 
derived from Eq. (6). We refer to this construction as a word representation RBM (WRRBM). 

Rather than training the WRRBM conditionally to model p{wn+t-i\wt, ■ ■ ■ , Wn+t-2) as in Mnih and Hinton 
[2007], we train it using Metropolis-Hastings to model the full joint distribution p{wt, ■ ■ ■ , Wn+t-i)- That is, we 
train the WRRBM based on the objective 

m = -^logp(v(i) = e^, , v(2) = e^,^, , . . . , v(;j^^_ J 
t 

using stochastic approximation from M-H sampling of the word observations. For models with n > 2, we also 
found it helpful to incorporate £2 regularization of the weights, and to use momentum when updating U^'^ . 



6.2 Chunking Task 

As described by Turian et al. [2010], learning real- valued word representations can be used as a simple way of 
performing semi-supervised learning for a given method, by first learning word representations on unlabeled text 
and then feeding these representations as additional features to a supervised learning model. 

We trained the WRRBM on windows of text derived from the English Gigaword corpus^. The datasct is a 
corpus of newswire text from a variety of sources. We extracted each news story and trained only on windows 
of n words that did not cross the boundary between two different stories. We used NLTK [Bird et al., 2009] to 

^http : //www . Idc . upenn . edu/Catalog/catalogEntry . j sp?catalogId=LDC2005T12 
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Model 



Valid 



Test 



CRF w/o word representations 
HLBL [Mnih and Hinton, 2009] 
C&W [Collobort and Weston, 2008] 
Brown clusters 



94.16 
94.63 
94.66 
94.67 



93.79 
94.00 
94.10 
94.11 



WRRBM 

WRRBM (with hidden units) 



94.82 
95.01 



94.10 
94.44 



Table 1: Experimental results on the chunking task for various baselines and our approach (WRRBM). The 
baseline results were taken from Turian et al. [2010]. The performance measure is Fl. 

tokenize the words and sentences, and also corrected a few common punctuation-related tokenization errors. As 
in CoUobcrt ct al. [2011], we lowercased all words and dclcxicalizcd numbers (replacing consecutive occurrences 
of one or more digits inside a word with just a single # character). Unlike Collobert et al. [2011], we did not 
include additional capitalization features, but discarded all capitalization information. We used a vocabulary 
consisting of the 100,000 most frequent words plus a special "unknown word" token to which all remaining words 
were mapped. 

We evaluated the learned WRRBM word representations on a chunking task, following the setup described 
in Turian et al. [2010] and using the associated publicly- available code, as well as CRFSuitc^. As in Turian ct al. 
[2010], we used data from the CoNLL-2000 shared task. We used a scale of 0.1 for the word representation features 
(as Turian et al. [2010] recommend) and for each WRRBM model, tried h penalties A e {0.0001, 1.2, 2.4, 3.2} for 
CRF training. We selected the single model with the best validation Fl score over all runs and evaluated it on 
the test set. The model with the best validation Fl score used 3-gram word windows, A = 1.2, 250 hidden units, 
a learning rate of 0.01, and used 100 steps of M-H sampling to update each word observation in the negative 
data. 

The results are reported in Table 1, where we observe that word representations learned by our model achieved 
higher validation and test scores than the baseline of not using word representation features, and are comparable 
to the best of the three word representations tried in Turian et al. [2010]"^. 

Although the word representations learned by our model are highly effective features for chunking, an im- 
portant advantage of our model over many other ways of inducing word representations is that it also naturally 
produces a feature vector for the entire n-gram. For the trigram model mentioned above, we also tried adding the 
hidden unit activation probability vector as a feature for chunking. For each word Wi in the input sentence, we 
generated features using the hidden unit activation probabilities for the trigram Wi-\WiWi^\. No features were 
generated for the first and last word of the sentence. The hidden unit activation probability features improved 
validation set Fl to 95.01 and test set Fl to 94.44, a test set result superior to all word embedding results on 
chunking reported in Turian et al. [2010]. 

As can be seen in Table 2, the learned word representations capture meaningful information about words. 
However, the model primarily learns word representations that capture syntactic information (as do the represen- 
tations studied in Turian et al. [2010]), as it only models short windows of text and must enforce local agreement. 

■^http : //www . chokkan. org/software/crf suite/ 

^Better results have been reported by others for this dataset, when using a larger vocabulary. The speetral approach of Dhillon 
et al. [2011] used a vocabulary of 300,000 words and obtains higher Fl scores than the methods evaluated in Turian et al. [2010]. 
Unfortunately, Dhillon et al. [2011] did not test their method with a vocabulary of 100,000 words as in Turian et al. [2010], making 
a direct comparison impossible. 
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could 


spokeswoman 


suspects 


science 


china 


mother 


Sunday 


should 


spokesman 


defendants 


sciences 


japan 


father 


Saturday 


would 


lawyer 


detainees 


medicine 


taiwan 


daughter 


friday 


will 


columnist 


hijackers 


research 


thailand 


son 


monday 


can 


consultant 


attackers 


economics 


russia 


grandmother 


thursday 


might 


secretary-general 


demonstrators 


engineering 


Indonesia 


sister 


Wednesday 


must 


strategist 


inmates 


arts 


iran 


grandfather 


tucsday 


did 


negotiator 


assailants 


psychology 


India 


brother 


yesterday 


wo 


administrator 


atrocities 


journalism 


nigeria 


girlfriend 


today 


does 


correspondent 


dissidents 


privacy 


greece 


husband 


tomorrow 


ca 


adviser 


killings 


nutrition 


Vietnam 


cousin 


tonight 


torn 


actually 


probably 


quickly 


earned 


what 


hotel 


jim 


finally 


certainly 


easily 


averaged 


why 


restaurant 


bob 


definitely 


definitely 


slowly 


clinched 


how 


theater 


kevin 


rarely 


hardly 


carefully 


retained 


whether 


casino 


brian 


eventually 


usually 


cifFectivcily 


rcgaincid 


whatever 


ranch 


steve 


hardly 


actually 


frequently 


grabbed 


where 


zoo 


chris 


ultimately 


surely 


badly 


netted 


something 


cafe 


david 


basically 


simply 


seriously 


saved 


whom 


tribune 


robert 


usually 


apparently 


quietly 


secured 


nothing 


warehouse 


joe 


somehow 


obviously 


strongly 


enjoyed 


everything 


symphony 


ron 


suddenly 


clearly 


closely 


surpassed 


neither 


nightclub 



Table 2: The ten nearest neighbors (in the word feature vector space) of some sample words. 



Nevertheless, word representations capture some semantic information, but only after similar syntactic roles have 
been enforced. Although not shown in Table 2, the model consistently embeds the following natural groups of 
words together (maintaining small intra-group distances): days of the week, words for single digit numbers, 
months of the year, and abbreviations for months of the year. A 2D visualization of the word representations 
generated by t-SNE [van der Maaten and Hinton, 2008] is provided at http://i.imgur.coin/Zbrz0.png. 

6.3 Sentiment Classification Task 

Maas et al. [2011] describe a model designed to learn word representations specifically for sentiment analysis. 
They train a probabilistic model of documents that is capable of learning word representations and leveraging 
sentiment labels in a semi-supervised framework. Even without using the sentiment labels, by treating each 
document as a single bag of words, their model tends to learn distributed representations for words that cap- 
ture mostly semantic information since the co-occurrence of words in documents encodes very little syntactic 
information. To get the best results on sentiment classification, they combined features learned by their model 
with bag-of-words feature vectors (normalized to unit length) using binary term frequency weights (referred to 
as "bnc"). 

We applied the WRRBM to the problem of sentiment classification by treating a document g of n- 

grams" , as this maps well onto the fixed-window model for text. At first glance, a word representation RBM might 
not seem to be a suitable model for learning features to improve sentiment classification. A WRRBM trained 



10 



Model 



Test 



LDA 
LSA 

Maas ct al. [2011]'s "full" method 

Bag of words "bnc" 

Maas et al. [2011]'s "full" method 

+ bag of words "bnc" 
Maas et al. [2011]'s "full" method 

+ bag of words "bnc" + unlabeled 



data 



67.42 
83.96 

87.44 
87.80 



88.33 



88.89 



WRRBM 

WRRBM + bag of words "bnc' 



87.42 
89.23 



Table 3: Experimental results on the sentiment classification task for various baselines and our approach (WR- 
RBM). The baseline results were taken from Maas et al. [2011]. The performance measure is accuracy (%). 

on the phrases "this movie is wonderful" and "this movie is atrocious" will learn that the word "wondcrftil" and 
the word "atrocious" can appear in similar contexts and thus should have vectors near each other, even though 
they should be treated very differently for sentiment analysis. However, a class-conditional model that trains 
separate WRRBMs on n-grams from documents expressing positive and negative sentiment avoids this problem. 

We trained class-specific, 5-gram WRRBMs on the labeled documents of the Large Movie Review dataset 
introduced by Maas et al. [2011], independently parameterizing words that occurred at least 235 times in the 
training set (giving us approximately the same vocabulary size as the model used in Maas ct al. [2011]). 

To label a test document using the class-specific WRRBM, we fit a threshold to the difference between the 
average free energies assigned to n-grams in the document by the positive-sentiment and negative sentiment 
models. We explored a variety of different hyperparameters (number of hidden units, training parameters, and 
n) for the pairs of WRRBMs and selected the WRRBM pair giving best training set classification performance. 
This WRRBM pair yielded 87.42% accuracy on the test set. 

We additionally examined the performance gain by appending to the bag-of-words features the average n-gram 
free energies under both class-specific WRRBMs. The bag-of-words feature vector was weighted and normalized 
as in Maas et al. [2011] and the average free energies were scaled to lie on [0, 1]. We then trained a linear SVM 
to classify documents based on the resulting document feature vectors, giving us 89.23% accuracy on the test 
set. This result is the best known result on this benchmark and, notably, our method did not make use of the 
unlabeled data. 

7 Conclusion 

We have described a method for training RBMs with large K-a.ry softmax units that results in weight updates 

with a computational cost independent of K, allowing for efficient learning even when K is large. Using our 
method, we were able to train RBMs that learn meaningful representations of words and n-grams. Our results 
demonstrated the benefits of these features for chunking and sentiment classification and, given these successes, 
we are eager to try RBM-bascd models on other NLP tasks. Although the simple proposal distribution we 
used for M-H updates in this work is effective, exploring more sophisticated proposal distributions is an exciting 
prospect for future work. 
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